Attribute Relation Extraction from Template-inconsistent Semi-structured Text by Leveraging Site-level Knowledge
نویسندگان
چکیده
A variety of methods have been proposed for attribute-value extraction from semistructured text with consistent templates (strict semi-text). However, when the templates in semi-structured text are inconsistent (weak semi-text), these methods will work poorly. To overcome the templateinconsistent problem, in this paper, we proposed a novel method to leverage sitelevel knowledge for attribute-value extraction. First, we use a graph-based random walk model to acquire site-level knowledge. Then we utilize such knowledge to identify weak semi-text in each page and extract attribute-value pairs. The experiments show that, comparing to the baseline method which does not utilize sitelevel knowledge, our method can improve the extraction performance significantly.
منابع مشابه
A New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model
Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...
متن کاملAgricultural Knowledge Discovery from Semi-Structured Text
This research aims to develop automatic knowledge discovery system from semi-structured Thai text for supporting plant diagnosis. Plant disease diagnosis is very important for farmers to be able to cure infected plants before infections become more severe. Prior to diagnosis, farmers need to gain knowledge retrieved primarily from text, including unstructured and semi-structured document. As th...
متن کاملHigh-Precision Web Extraction Using Site Knowledge
In this paper, we study the problem of extracting structured records from semi-structured Web pages. Existing Web information extraction techniques like wrapper induction require a large amount of editorial effort for annotating pages. Other schemes based on Conditional Random Fields (CRFs) suffer from precision loss due to variable site structures and abundance of noise in Web pages. In this p...
متن کاملKnowledge Base Augmentation using Tabular Data
Large linked data repositories have been built by leveraging semi-structured data in Wikipedia (e.g., DBpedia) and through extracting information from natural language text (e.g., YAGO). However, the Web contains many other vast sources of linked data, such as structured HTML tables and spreadsheets. Often, the semantics in such tables is hidden, preventing one from extracting triples from them...
متن کاملA Fuzzy Approach for Pertinent Information Extraction from Web Resources
Recent work in machine learning for information extraction has focused on two distinct sub-problems: the conventional problem of filling template slots from natural language text, and the problem of wrapper induction, learning simple extraction procedures (“wrappers”) for highly structured text such as Web pages. For suitable regular domains, existing wrapper induction algorithms can efficientl...
متن کامل